POLI 572B

Michael Weaver

March 7, 2022

Inference with Least Squares

Objectives

Recap

  • Least Squares as a predictive algorithm

Inference

  • Least Squares as an estimator of some unknown parameters

Today

Least Squares

  • statistical model for inference
    • what is the statistical model? (like, e.g. Neyman Causal Model)
    • what are the assumptions we need for…
      • unbiasedness of estimator
      • deriving uncertainty of estimate

Recap

Least Squares for Prediction

So far, we…

  • …start with a vector \(\mathbf{Y} = (Y_1 \ldots Y_i \ldots Y_n)\)

  • … want to produce a vector of predicted values \(\mathbf{\widehat{Y}} = (\widehat{Y_1} \ldots \widehat{Y_i} \ldots \widehat{Y_n})\)

  • …. defined as \(\mathbf{\widehat{Y}} = \mathbf{X}\mathbf{\beta}\):

    • where \(\mathbf{X}\) is a matrix of predictors (for the mean, just a vector of \(1\)s) and \(\mathbf{\beta}\) is a vector of linear coefficients that are multiplied by \(\mathbf{X}\) such that \(\widehat{Y}_i = b_0x_0 + b_1x_1 + \ldots + b_px_p\)

Least Squares for Prediction

Then, we…

  • … choose \(\beta\) such that \(\mathbf{\widehat{Y}}\) have minimum (euclidean) distance to \(\mathbf{Y}\):

    • minimize \(\mathbf{Y} - \mathbf{\widehat{Y}} = \mathbf{e}\); where \(\mathbf{e}\) is a vector of residuals (prediction errors) \((Y_1 - \hat{Y_1}, \ldots , Y_i - \hat{Y_i}, \ldots, Y_n - \hat{Y}_n)\)
  • \(\mathbf{e}\) is minimized when \(\mathbf{e}'\mathbf{X} = \mathbf{0}\): residuals are orthogonal to \(\mathbf{X}\).

This gave us \(\mathbf{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Least Squares for Prediction

This definition of least squares means that:

  • residuals \(\mathbf{e}\) are orthogonal to \(\mathbf{X}\); so if there is an intercept, the mean of the residuals is \(0\), and covariance of residuals and \(\mathbf{X}\) is \(0\).

  • If we just fit an intercept (same value of \(\widehat{Y}\) for all cases): what is the mean of the residuals \(E(Y_i - \widehat{Y_i})\)?

Example

Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.

  • UHRSWORK (the usual hours worked per week)
  • FEMALE (an indicator for female)
  • AGE (in years)
  • LAW (an indicator for being a lawyer)
  • INCEARN (total annual income earned in dollars).

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:

Example

## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304486  -90963  -29238   72845  743690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11083.5     9484.6   1.169    0.243    
## UHRSWORK      1075.5      114.7   9.373   <2e-16 ***
## AGE           3452.8      125.2  27.578   <2e-16 ***
## LAW         -48919.9     2674.6 -18.291   <2e-16 ***
## FEMALE      -37223.1     2799.8 -13.295   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared:  0.146,  Adjusted R-squared:  0.1457 
## F-statistic: 427.2 on 4 and 9995 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Example

Example

Example

Looking at residuals, gender and hours worked have non-linear relationship. How can we turn this non-linear relationship into a linear relationship?

Angrist and Pischke, MHE, pp. 77-80

  • transform continuous variables into dummies: dummies can only have a linear relationship
  • ensures that least squares correctly captures relationship between variables

Inference

Regression for Inference:

  1. We want to interpret the \(\beta\) vector causally.
  • typically just one coefficient of causal interest: why? (think about DAGs)
  1. There is a true \(\beta_D\) (\(\delta\)) that is the effect of some \(D\) on \(Y\).
  • either a true linear effect, or weighted average of non-linear effects
  1. \(\beta_D\) is an unknown parameter.

  2. To estimate \(\beta_D\), need a statistical model for least squares

A Statistical Model

As with experiments (e.g. the Neyman causal model), statistical models

  • treat observed data as realizations of random variables (some chance process)
  • with some assumptions…
  • permit us to connect what we observe to unobserved parameters

Need to clarify what work is being done by

  • data
  • regression mathematics
  • assumptions of statistical model

An example: Hooke’s Law

Hooke’s law is a law of physics that states that the force (\(D\)) needed to extend or compress a spring by some distance \(y\) increases linearly with respect to the distance.

Thus, for a given weight \(D_i\) on a spring, the length \(Y_i\) is determined by:

\(Y_i = \alpha + \beta D_i + \epsilon_i\) for \(i\) in \(1 \ldots n\)

An example: Hooke’s Law

\(Y_i = \alpha + \beta D_i + \epsilon_i\) for \(i\) in \(1 \ldots n\)

This is a statistical model:

  • \(\alpha\) and \(\beta\) are unknown parameters (thus, greek letters, not \(a\) and \(b\)). They are fixed.
    • \(\alpha\) is the length of spring in absence of force
    • \(\beta\) is the change in length of spring with unit of force
  • \(\epsilon_i\) is an unknown random error (a random variable). Each \(\epsilon_i\) is…
    • independent of \(D_i\)
    • each drawn from the same distribution with mean \(0\) and variance \(\sigma^2\) (another parameter)

We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)

In groups:

Imagine we have a spring whose length is 3 cm. With each Newton of force applied, the spring increases by 0.5 cm.

Using these facts and the statistical model described above: create a response schedule/potential outcomes table corresponding to the spring under loads of \(0,1,2,3,5,8\) Newtons.

\(Y_i(D_i = 0), Y_i(D_i = 1), Y_i(D_i = 2), \ldots\)

The Regression Model:

The statistical model assumes that this linear equation describes the process generating the data.

\[Y_i = \alpha + \beta D_i + \epsilon_i\]

deterministic (not random) \(\alpha + \beta D_i\)

stochastic (random error) \(\epsilon_i\)

The Regression Model:

\(\alpha\): parameter for the intercept (fixed for all cases)

\(\beta\): parameter for the slope (fixed for all cases)

\(D_i\): value of independent variable (input into the equation)

\(Y_i\): value of dependent variable (determined by the equation)

\(\epsilon_i\): random error for observation \(i\)

  • each \(\epsilon_i\) is realization of same random variable with mean \(0\) and variance \(\sigma^2\).
  • independent of each other and \(D_i\)

In the context of Hooke’s law and other physical laws may make sense to expect the effect \(\beta_D\) to be linear and fixed for all observations.

Think back to experiments: in context of political science, does this make sense?

  • e.g., is the effect of income on preferences for redistribution the same for all people? Linear?

The Regression Model:

It is possible to think of regression estimates like this:

  • \(\beta_D\): is weighted average of non-linear effect of \(D\) on \(Y\) across different values of \(D\)
  • \(\beta_D\): can represent the average effect, even if there are heterogeneous unit causal effects

\(Y_i = \alpha_i + \beta_{D_i} D_i + \nu_i\) \(Y_i = \alpha_i + (\beta_{D} + (\beta_{D_i} - \beta_{D})) D_i + \nu_i\) \(Y_i = \alpha_i + \beta_{D} D_i + (\nu_i + (\beta_{D_i} - \beta_{D})D_i)\)

… if heterogeneous effects are independent of \(D\)

The Regression Model:

The more general regression model (where \(X_i\) is a matrix of variables)

\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]

can be estimated with least squares:

\[Y_i = X_i\hat{\beta} + e_i\]

where residuals used to estimate variance of errors: \(Var(e) = \widehat{\sigma^2}\)

The Regression Model:

Differences from Least Squares Algorithm

  • we now have estimates (\(\widehat{}\)) of parameters
  • we now have unobserved errors \(\epsilon_i\) instead of residuals \(e_i\).

Least Squares is the Best Linear Unbiased Estimator of our model parameters under 3 key assumptions.

  • by unbiased we mean \(\beta - E(\widehat{\beta}) = 0\)
  • “best” as in estimate has least variance (not so important in practice)

Model Assumptions:

\((1)\). Model equation is correctly specified

  • correct functional form (e.g., linear, additive) for process by which variables in \(X\) affect \(Y\), and how variables in \(X\) affect each other.
  • all relevant variables affecting \(Y\) (will define next week) are included in the model
  • \(Y_i = \mathbf{X_i}\mathbf{\beta} + \epsilon_i\) is the true data-generating function

In groups: come up with 2 ways this assumption could be wrong in the context of this model:

\(Earn_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i + \epsilon_i\)

Model Assumptions:

\((2.)\) \(\epsilon_i\) are independent, identically distributed (i.i.d.)

  • \(E(\epsilon_i) = 0\) and variance of \(\sigma^2\).
  • homoskedastic
    • errors \(\epsilon\) all come from same distribution with given variance
  • in contrast to heteroskedastic:
    • errors come from different distributions with different variances.

In groups: come up with 2 ways this assumption could be wrong in the context of the Hooke’s Law example.

Model Assumptions:

\((3.)\) \(\epsilon_i\) is independent of \(X_i\): \(X_i \perp \!\!\! \perp \epsilon_i\)

  • This means \(X\) is independent of \(\epsilon\) (recall dependence versus independence)
  • independence is not orthogonality

In groups: come up with 2 ways this assumption could be wrong in the context of this model:

\(Earn_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i + \epsilon_i\)

Model Assumptions:

Under these three assumptions, we call the model “ordinary least squares”.

We cannot empirically verify these assumptions.

In particular, cannot verify assumption 3:

  • When using least squares to estimate this model, we mathematically impose \(e_i\) to be orthogonal to \(x_i\) and \(E(e_i) = 0\). This is a mathematical feature of regression, not proof that \(X_i \perp \!\!\! \perp \epsilon_i\)

Model Assumptions:

If these assumptions are reasonable for the data we observe, then with large \(n\), we can estimate our parameters \(\beta, \sigma^2\) using least squares.

And \(\beta\) could have a causal interpretation.

Estimating the Regression Model

The Regression Model:

What work is done by

  • the data?
  • the math of regression?
  • the assumptions?

The Regression Model:

Mathematical Assumptions (least squares):

  • design matrix has full rank (no linear dependence b/t columns)
  • \(n \geq p\)

Statistical Assumptions (statistical model):

  • \(Y_i\) is generated by \(\mathbf{X_i\beta} + \epsilon_i\)
  • \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\)
  • \(X_i \perp \!\!\! \perp \epsilon_i\)

Estimation:

If \(\mathbf{X}\) is \(n \times p\) matrix: We want to estimate \(p + 1\) parameters… (what are they?)

\(\beta\) estimated by \(\widehat{\beta}\) using least squares:

\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]

  • Note: \(\widehat{\beta}\) is a random vector; each element is a random variable. Why?

\(Y_i = X_i\beta + \epsilon_i\)

\(\epsilon_i\) is random variable.

If we were to repeat the drawing of \(\epsilon\) again, we would get different estimates: \(\widehat{\beta}\).

Regression and Random Variables

Compare this against the logic of experiments:

  • in experiments, random assignment makes mean observed potential outcomes of \(Y\) into random variables.
  • in regression, what is the “chance process” generating \(\epsilon\)? >- Doesn’t really make a lot of sense; even recent textbooks hand-wave and say it isn’t important to interpret this; the math just works >- A big open question, because \(\epsilon\) drives our variance estimates for \(\widehat{\beta}\)

Estimation:

Under the statistical assumptions we made, \(\widehat{\beta}\) is an unbiased estimator of \(\beta\), conditional on \(X\): \(E(\widehat{\beta}|X) = \beta\)

  • “conditional on \(X\)” means taking the values of \(X\) as fixed; not a sample from a population

LS estimate is unbiased:

By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)

\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

LS estimate is unbiased:

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)

So:

\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)

And because, by assumption, \(X \perp \!\!\! \perp \epsilon\)

\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)

And because, by assumption, \(E(\epsilon) = 0\)

\(E(\widehat{\beta}|X) = \mathbf{\beta}\)

Variance of LS Estimator?

Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of causal parameter \(\beta\)

  • \(\widehat{\beta}\) is a random vector, as it depends on chance draws of \(\epsilon\)
  • As \(n \to \infty\), \(\widehat{\beta} \approx \beta\), but there is still uncertainty.

How do we discover the uncertainty in our estimate \(\widehat{\beta}\)?

Variance of our Estimate

Variance of OLS Estimator:

  • Each element of \(\widehat{\beta}_{p \times 1}\) is a random variable
    • each \(\widehat{\beta}_k\) has a variance
    • and a covariance with other \(\widehat{\beta}_{j\neq k}\)
  • These variances and covariances are found in the variance-covariance matrix of \(\widehat{\beta}\)
    • a \(p \times p\) matrix
    • Diagonal elements are the variances for \(\widehat{\beta}_{1} \dots \widehat{\beta}_p\)
    • off-diagonal elements are covariances;
    • matrix is symmetric

Variance-Covariance Matrix

\[\scriptsize{\begin{pmatrix} Var(\widehat{\beta}_1) & Cov(\widehat{\beta}_1,\widehat{\beta}_2) & Cov(\widehat{\beta}_1,\widehat{\beta}_3) &\ldots & Cov(\widehat{\beta}_1,\widehat{\beta}_p) \\ Cov(\widehat{\beta}_2,\widehat{\beta}_1) & Var(\widehat{\beta}_2) & Cov(\widehat{\beta}_2,\widehat{\beta}_3) & \ldots & Cov(\widehat{\beta}_2,\widehat{\beta}_p) \\ Cov(\widehat{\beta}_3,\widehat{\beta}_1) & Cov(\widehat{\beta}_3,\widehat{\beta}_2) & Var(\widehat{\beta}_3) & \ldots & Cov(\widehat{\beta}_3,\widehat{\beta}_p) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\widehat{\beta}_p,\widehat{\beta}_1) & Cov(\widehat{\beta}_p,\widehat{\beta}_2) & Cov(\widehat{\beta}_p, \widehat{\beta}_3) & \ldots & Var(\widehat{\beta}_p)\end{pmatrix}}\]

Variance of OLS Estimator:

How do we use the variance-covariance matrix?

  • The square-root of diagonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)

  • The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\). (complex hypothesis testing, e.g. interaction effects )

Variance-Covariance Matrix

\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:

\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]

\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

What really matters here is: \(E(\epsilon\epsilon'|X)\)

Variance-Covariance Matrix

All variance-covariance matrices are “sandwiches”:

\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) is the bread; \(E(\epsilon\epsilon'|X)\) is the meat.

Later on, we can make different choices of “meat”.

Variance-Covariance Matrix

In groups: if \(\epsilon\) is a vector that is \(n \times 1\) with elements \((\epsilon_1 \ldots \epsilon_n)\)

  1. What are the dimensions of the matrix: \(\epsilon\epsilon'\)?

  2. What are the elements on the diagonal of the matrix?

  3. What is the expected value of the elements on the off-diagonal

Hints: We have assumed that \(E(\epsilon) = 0\); \(\epsilon_i\) are independent of each other

Variance-Covariance Matrix

\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{n\times n}\]

\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)

  • this is the variance covariance matrix of the errors (we don’t subtract means because \(E(\epsilon) = 0\))

Variance-Covariance Matrix

Taking the expectation:

  • If \(i \neq j\), the elements of the matrix are \(E(\epsilon_i\epsilon_j)\). Because, by assumption, \(E(\epsilon) = 0\), and \(\epsilon\)s are independent, \(E(\epsilon_i\epsilon_j) = E(\epsilon_i)E(\epsilon_j) = 0 \times 0 = 0\)
  • but, for \(i = j\), \(E[\epsilon_i^2] = E[(\epsilon_i - E[\epsilon])^2]\) (a variance)
  • and because \(\epsilon\) is identically distributed, then \(Var(\epsilon_i) = Var(\epsilon) = \sigma^2\) for all \(\epsilon\)

The result is a \(n \times n\) matrix, with \(\sigma^2\) on the diagonal, \(0\) everywhere else.

Variance-Covariance Matrix

\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]

\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]

This gives us the variance-covariance matrix of \(\widehat{\beta}\). Can we start computing standard errors for \(\widehat{\beta}\) now?

Variance-Covariance Matrix

We use residuals to stand in for \(\epsilon\).

\[\widehat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\]

\(E[\widehat{\sigma^2} | X] = \sigma^2\) (unbiased)

  • Why \(n-p\)? The variance of the sample mean required \(n-1\), because we calculated the mean first, so we lost 1 degree of freedom. With bivariate regression, we lose 2 df. With \(p\) variables, we lose \(p\) degrees of freedom.

Variance-Covariance Matrix

so, we estimate the variance-covariance matrix:

\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)

Standard Errors of \(\widehat{\beta_p}\)

The variance (\(p\)th diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is

\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{ip} - \overline{X_p^*})^2}\)

So the standard error is:

\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)

  • standard errors get smaller (more precise): with growing \(n\), increased (residual) variation in \(X_p\), smaller error variance (\(\sigma^2\))

Sampling Distribution of \(\hat{\beta}\)

asymptotic normality:

Even if \(\epsilon\) not normally distributed: under our assumptions, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. Reasonable if \(n\) is moderately large and data is not highly skewed.

Because we estimate standard errors: \(t\) statistic for hypothesis testing of \(\widehat{\beta}\) asymptotically \(t\) distributed:

Standard errors

In sum:

  • We get \(\widehat{\sigma^2}\) by taking variance of the residuals (\(e\))
  • We take diagonal of variance-covariance matrix to get variances of slopes (\(\widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\))
  • Take square root of variances to standard errors for \(\widehat{\beta}\)

Standard errors

For standard errors of \(\widehat{\beta}\) to be correct, we must assume:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

If any assumption is wrong: standard errors will be incorrect.

If assumption 1 and 3 are wrong: unbiasedness of \(\widehat{\beta}\) does not hold.

In groups:

Download https://pastebin.com/LeSpqKKk

  1. Regress gop_change on enlistment and an intercept using matrix calacualtions
  2. Create the residuals e
  3. Create the variance-covariance matrix
  4. Get the coefficient and standard error for enlistment
  5. For each of the three least squares assumptions:
  • what is an example of the assumption being wrong in this case?
  • why would it lead standard errors to be wrong?

OLS Assumptions:

When do things go wrong?

  • We incorrectly specify the model (it is multiplicative, or we forgot a term)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 x^2 + \epsilon\)
  • \(Y_i = \alpha + \beta_1 x + \beta_2 z + \epsilon\)
  • Errors are heteroskedastic: errors have different variances.
  • E.g.: At higher levels of education, incomes take on wider range of values (think PhD vs. MBA students earning potential), so the errors are larger.

OLS Assumptions:

  • Errors are not independent: errors for different observations are correlated.
  • We observe the same counties over time. The error we observer for county \(i\) at time \(t\) is likely to be correlated with the error of county \(i\) at time \(t+1\), because it is the same place.
  • \(\epsilon\) is not independent of \(X\). The errors are in fact correlated with \(X\):
  • This is “confounding”. Our estimate \(\hat{\beta}\) as the coefficient of \(x\) will be biased. If we excluded \(z\) from the equation incorrectly, the bias in \(x\) will be a function of the relationships between \(x\) and \(z\), and between \(z\) and \(y\).

Conclusion:

  1. \(Y\) is generated by \(\mathbf{X\beta} + \epsilon\)
  2. \(\epsilon_i\) are independent, identically distributed, with variance \(\sigma^2\) for all \(i\)
  3. \(X_i \perp \!\!\! \perp \epsilon_i\)

With these assumptions:

  • \(\widehat{\beta}\) is unbiased estimator for \(\beta\)
  • known sampling distribution of \(\widehat{\beta}\)